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ABSTRACT 

Motivation: Transcription factor (TF) ChlP-seq datasets have 
particular characteristics that provide unique challenges and 
opportunities for motif discovery. Most existing motif discovery 
algorithms do not scale well to such large datasets, or fail to report 
many motifs associated with cof actors of the ChlP-ed TF. 
Results: We present DREME, a motif discovery algorithm specifically 
designed to find the short, core DNA-binding motifs of eukaryotic 
TFs, and optimized to analyze very large ChlP-seq datasets in 
minutes. Using DREME, we discover the binding motifs of the the 
ChlP-ed TF and many cofactors in mouse ES cell (mESC), mouse 
erythrocyte and human cell line ChlP-seq datasets. For example, in 
mESC ChlP-seq data for the TF Esrrb, we discover the binding motifs 
for eight cofactor TFs important in the maintenance of pluripotency. 
Several other commonly used algorithms find at most two cofactor 
motifs in this same dataset. DREME can also perform discriminative 
motif discovery, and we use this feature to provide evidence that 
Sox2 and Oct4 do not bind in mES cells as an obligate heterodimer. 
DREME is much faster than many commonly used algorithms, scales 
linearly in dataset size, finds multiple, non-redundant motifs and 
reports a reliable measure of statistical significance for each motif 
found. DREME is available as part of the MEME Suite of motif-based 
sequence analysis tools (http://meme.nbcr.net). 
Contact: t.bailey@uq.edu.au 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

It is estimated that the human genome contains 1500 transcription 
factors (TFs) that play a key role in regulating transcription by 
binding to the genome alone or in protein complexes (Vaquerizas 
et ah, 2009). Chromatin immunoprecipitation (ChIP) followed 
by high-throughput sequencing (ChlP-seq) is the current method 
of choice for determining the genomic binding locations of a 
TF. The resolution of ChlP-seq location data is such that the 
actual binding location is typically within ~50bp of the predicted 
location (Wilbanks and Facciotti, 2010). A TF ChlP-seq experiment 
generally yields hundreds to tens of thousands of predicted locations 
(often called 'ChlP-seq peaks'). In addition to providing evidence 
of direct regulation by the TF proximity to individual genes, the 
ChlP-seq data provide a rich resource for exploring the in vivo DNA- 
binding affinity of the ChlP-ed TF and cofactor TFs that bind DNA 
in complex or nearby, and for discovering the identities of these 
cofactors. 

Ab initio motif discovery algorithms applied to ChlP-seq peak 
regions can usually discover the DNA-binding motif of the ChlP-ed 



TF. However, most existing algorithms have limitations that make 
them less than ideal for discovering all the cofactor motifs in a 
ChlP-seq dataset. One limitation of many popular algorithms is that 
they do not scale well to inputs containing thousands of sequences. 
As a result, many studies to date have used only a fraction (say 
500 sequences) of the available ChlP-seq peak regions for motif 
discovery, greatly decreasing the chances of discovering motifs for 
infrequent cofactors. A second limitation is that many algorithms do 
not provide reliable motif statistics to enable the biologist to discern 
functional motifs from statistical artifacts. This limitation can cause 
valid motifs to be ignored, or time to be wasted investigating random 
motifs. A third limitation is that most algorithms do not allow two 
ChlP-seq datasets to be directly compared, finding motifs that are 
relatively enriched. This limitation makes it difficult to determine 
whether two TFs always bind as a heterodimer or even if the motif 
for the ChlP-ed TF has been found at all. Other limitations possessed 
by some algorithms that make them unsatisfactory for ChlP-seq data 
analysis include only finding a single motif or being 'hard- wired' to 
use a particular type of sequence (e.g. promoters). 

Here, we describe a novel motif discovery algorithm lacking the 
above limitations, and demonstrate how to use it to mine ChlP- 
seq datasets for primary and cofactor motifs and for insights into 
cooperative or indirect binding by TFs. We show that the new 
algorithm is significantly faster than several algorithms that are 
commonly used for analyzing ChlP-seq datasets, yet is able to 
discover substantially more cofactor motifs. The algorithm searches 
for motifs expressed as simplified 'regular expressions' (consensus 
sequences allowing wildcards but not variable length gaps). It 
achieves speed partly by limiting its search to short motifs (up 
to 8 bp wide), which turns out to be ideal for identifying and 
studying the DNA-binding motifs of monomeric eukaryotic TFs, 
but may cause it to miss some wider motifs due to binding by 
large TF complexes. Our new algorithm has some similarities with 
a few existing, word-based algorithms such as Trawler (Ettwiller 
et al, 2007) and Amadeus (Linhart et al, 2008)), but our approach 
is simpler and we demonstrate that it finds substantially more 
cofactor motifs. Consequently, our algorithm complements rather 
than replaces existing motif discovery tools for the analysis of 
ChlP-seq data. 



2 METHODS 

2.1 DREME: discriminative regular expression motif 
elicitation 

Our design objective for a motif discovery algorithm tailored to eukaryotic 
ChlP-seq data is the ability to quickly discover multiple, short, non- 
redundant, statistically significant, discriminative motifs in extremely large 
sets of short sequences. The focus on short (from 4 to 8 bp) motifs is 
motivated by the observation that this range encompasses the DNA-binding 
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region of most eukaryotic monomeric TFs, and existing algorithms can easily 
find the larger, more information-rich motifs characteristic of binding by TF 
protein complexes. In the interest of computational speed, we restrict the 
search for motifs to a simplified form of 'regular expression': words over 
the IUPAC alphabet, which adds eleven wildcard characters to the standard 
DNA alphabet, ACGT. For example, the binding motif of the Klf family 
of TFs is well represented by the IUPAC regular expression (RE) motif 
CMCRCCC (called a 'CACC-box'), where M matches A or C and R matches 
the purines A or G. 

Our search for motifs is exhaustive for exact words (no wildcards), but 
heuristic for words with wildcards, again in the interest of speed. To identify 
statistically significant, discriminative motifs, we compute the significance 
of the relative enrichment of each motif in two sets of sequences using the 
Fisher's Exact Test. This test computes the probability that the fraction of 
sequences in the first set matching the motif would be as great as observed 
(or greater), given the fraction of matching sequences in the second set. 
In a typical application of our approach, one set of sequences would be a 
set of ChlP-seq peak regions from a TF ChlP-seq experiment, and the other 
would either be similar data from a different ChlP-seq experiment or shuffled 
versions of the first sequences. To avoid problems with self- overlapping 
motifs, we only count the number of sequences containing a motif (not the 
number of occurrences of the motif) in each of the two datasets. However, 
once the motif with the highest significance has been found, all of its non- 
overlapping occurrences in the first set of sequences are aligned to create a 
position- specific probability matrix (Stormo, 2000). Finally, to find multiple, 
non-redundant motifs in a set of sequences, we simply 'erase' the best motif 
found by setting all its occurrences to a special letter that cannot match any 
motif, and then repeat the search for motifs. 

DREME implements this motif discovery approach. The input to DREME 
is two sets of DNA sequences and a significance threshold. The algorithm's 
outer loop performs a heuristic search of RE motifs, determines the most 
significant motif, reports it and erases all its occurrences in the input datasets. 
The outer loop is repeated until no new motif has lvalue less than the given 
significance threshold. 

The central task of the DREME algorithm is searching the space of 
RE motifs. Our approach uses a beam search that starts with a set of 
highly significant 'seed' RE motifs and attempts to find more significant 
generalizations of them. To initialize the beam search, DREME computes the 
significance of each word (no wildcards) of length three to eight that occurs 
in the positive sequences. DREME does this by counting how many positive 
and negative sequences contain each word, and computing the (uncorrected) 
p-value of the Fisher's Exact Test. This is done in time linear in the size of 
the datasets. The 100 most significant words are then passed as 'seed' REs 
to an inner loop that performs the beam search. 

At each iteration of the inner loop, DREME considers all generalizations 
of each seed RE that differ from the seed RE by containing exactly 
one additional wildcard. If all the words that match a generalization of 
the seed RE are significant (uncorrected p<0.01), DREME estimates the 
generalized RE's statistical significance, otherwise, the generalization is 
discarded. To save computation time, DREME estimates the significance of 
candidate REs without scanning the input sequences. When all seed REs have 
been generalized, DREME sorts the new, more general REs by estimated 
significance, and then computes the exact significance of the top 100 to use 
as seed REs in the next iteration. The inner loop stops when no seed RE can 
be generalized to include an additional wildcard, typically after three or four 
iterations. 

To estimate the number of sequences matching a generalized RE, DREME 
uses the fact that this is always equal to the size of the union of the matching 
sets of two more specific REs, as shown in Supplementary Table S 1 . DREME 
assumes that these two matching sets (REi and RE2) are independent samples 
from a set of N sequences, so the size of the union (RE3) is given by 

IRE1HRE2I 

|RE 3 | « IRE1I + IRE2I- - U N , (1) 



where |RE| is the size of the set of sequences matching a given RE. For 
example, if REi = ATGCG is a seed RE, and RE2 = ATGCT, which differs 
from it only in the last position, is significant, then DREME estimates the 
number of positive (or negative) sequences matching the generalization 
RE3 = ATGCK using Equation (1). DREME applies the Fisher's Exact Test to 
the estimated numbers of positive and negative sequences matching ATGCK 
and caches the result for later use in estimating the numbers of sequences 
matching ATGCD and ATGCH. By performing generalizations in the order 
shown in Supplementary Table SI , DREME efficiently estimates the numbers 
of sequences matching a generalized RE. 

2.2 Evaluating motif discovery 

We evaluate DREME and compare it with a number of existing motif 
discovery algorithms using several TF ChlP-seq datasets from mouse 
embryonic stem cells (mES cells), mouse erythrocytes and a human 
lymphoblastoid cell line. In our evaluation, we consider speed, ability to 
identify the primary motif, identification of secondary motifs and illustrate 
the usefulness of being able to do discriminative motif discovery in ChlP-seq 
datasets. To determine what motifs an algorithm discovers, we compare its 
output motifs to a large panel of known motifs using the TOMTOM (Gupta 
et al., 2007) algorithm and use gene expression data from the Gene 
Expression Atlas (Kapushesky et al, 2010) to determine which TF among 
TFs with similar DNA-binding affinity may bind to secondary motifs. 

2.2.7 Running motif discovery algorithms We run DREME using its 
default settings. When only one set of sequences is input, DREME creates 
a dinucleotide-shuffled version of the input sequence set as the negative 
sequence set. For discriminative motif discovery, we provide two sets of 
input sequences to DREME. In both cases, the significance threshold for 
motifs is £ = 0.05. 

We compare the speed and accuracy of several popular motif 
discovery algorithms with that of DREME using the 13 mESC ChlP-seq 
datasets. Two of these algorithms [MEME (Bailey and Elkan, 1995) and 
nestedMICA (Down and Hubbard, 2005)] are extremely slow when run 
on more than 500 sequences, so we randomly choose 500 sequences from 
each ChlP-seq dataset to create datasets of reduced size. We run the other 
three algorithms [WEEDER (Pavesi et al, 2004), Trawler (Ettwiller et al, 
2007) and Amadeus (Linhart et al, 2008)] on the same datasets that were 
used above with DREME. Where possible, we parameterize the programs to 
search for motifs of 4-7 bp. The exceptions are WEEDER up to 8 bp; Trawler 
minimum 4 bp; Amadeus exactly 7 bp. The number of motifs to search for 
is a required parameter for nestedMICA, and we set it to 10 to keep running 
times reasonably short. (It does not finish running after 134h on the full- 
size E2fl dataset, which contains 20699 sequences.) We set the maximum 
number of motifs to find to 20 for MEME; WEEDER, Trawler and Amadeus 
have no such parameter. With MEME, we use a 0-order background model 
and with nestedMICA we use a 0-order four class background model. As 
background sequences for DREME, Amadeus and Trawler, we use 1, 5 and 
10 dinucleotide-shuffled copies, respectively, of the foreground sequences. 
With WEEDER, we use its pre-computed background model for mouse, 
which contains the frequencies of all 8mers in regions 1000 bp upstream of 
genes. Exact command lines and more details are given in the Supplementary 
Material. 

2.2.2 ChlP-seq datasets We use 13 mESC ChlP-seq datasets (Chen 
et al, 2008), 3 mouse erythrocyte datasets (Cheng et al, 2009; Kassouf 
et al, 2010; Tallack et al, 2010) and 1 human lymphoblastoid cell line 
dataset (Ramagopalan et al, 2010). The mESC datasets are for 12 TFs that are 
key to the maintenance of pluripotency, plus CTCF. The mouse erythrocyte 
datasets are for Gatal, Klfl and Tall, key regulators of erythropoiesis. The 
human lymphoblastoid cell line data is for the vitamin D receptor (VDR) and 
includes separate ChlP-seq data for cells before and after stimulation with 
calcitriol. To prepare the datasets for use with motif discovery algorithms, we 
map the (centers of the) ChlP-seq peaks declared by the respective authors 
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to the genome, and extract the 100 bp of genomic sequence centered on each 
peak. [The one exception is Gatal, where we used the peaks declared by 
Tallack et al (2010) in the Cheng et al. (2009) data.] 

2.23 Identifying discovered motifs To determine the success of motif 
discovery, we compare the motifs to a database of known TF motifs using 
TOMTOM (Gupta et al, 2007). The motif database comprises all vertebrate 
motifs in the JASPAR database (Sandelin et al, 2004) (146 motifs) and 
all mouse motifs in the UniProbe database (Berger and Bulyk, 2009) (386 
motifs). Because the two motif reference datasets lack a motif for the key 
pluripotency TF Nanog, we also include an in vitro motif for Nanog taken 
from the literature (Jauch et al, 2008), for a total of 533 motifs. We only 
consider statistically significant TOMTOM predictions (£<0.05, corrected 
for 533 tests). 

Because many TFs have very similar DNA-binding affinities, many of the 
known motifs in this database are similar to each other, so some discovered 
motifs have significant matches to multiple motifs in the database. Therefore, 
in order to determine to which TF a discovered motif may correspond, 
we use prior knowledge from the literature and evidence of increased 
expression in the ChlP-ed tissue type. To resolve multiple matches, we 
take the most significant TOMTOM prediction among matching TFs that 
are either upregulated in, or known to be central regulators of, the ChlP-ed 
cell type. We consider a TF as upregulated in mouse ES cells if it was so 
marked for at least one experiment in cell type EF0_000462 in the Gene 
Expression Atlas (Kapushesky et al, 2010) (http://www.ebi.ac.uk/gxa/). For 
motifs discovered in mouse erythrocytes and human lymphoblastoid cell 
lines, we restrict matching candidate motifs to TFs previously shown to be 
important to transcriptional regulation in those cell types. 

3 RESULTS 

3.1 Discovering motifs in a single ChlP-seq dataset 
using DREME 

3.1.1 Discovering motifs in mouse ES cell ChlP-seq data 
DREME reports between 10 and 33 significant motifs (P<0.05) 
for each of the mESC datasets (Table 1, column 3), and the number 
of significant motifs it finds is highly correlated with the number 
of ChlP-seq peaks in the input dataset. It discovers the ChlP-ed 
(primary) TF motif in 10 of the 13 mESC ChlP-seq datasets (Table 1, 
column 4), and it is the most significant motif found by DREME 
in those 10 cases. In two of the remaining three cases (Nanog 
and E2fl ChlP-seq datasets), DREME finds a highly significant 
motif that is similar to the known motif, but the TOMTOM /?-value 
does not meet our 0.05 threshold due to the large number of 
multiple tests (533 reference motifs). However, the uncorrected p- 
values are quite low (Nanog, p = 0.0021; E2fl, /? = 0.0016). Since 
DREME reports 24 motifs in the Nanog dataset, we could reject a 
slightly different null hypothesis that stated that the known Nanog 
motif was not found (Nanog, p = 0.05, corrected for 24 tests) and 
similarly for E2fl (E2fl, p = 0.04, corrected for 25 tests). In the 
final case (Smadl ChlP-seq dataset), DREME does not find a 
motif that matches the only available in vitro Smad-family motif 
(UniProbe Smad3_primary), nor the motif reported by Chen et al. 
(The complete DREME and TOMTOM output on the mESC datasets 
is given in the Supplementary Material.) 

In addition to finding the primary motif, DREME discovers 
binding motifs for up to 12 cofactors in each of the mESC ChlP-seq 
datasets (Table 1, column 4). Many of the motifs found by DREME 
in each mES cell dataset correspond to 1 of 12 key pluripotency TFs 
(shown in bold font in Table 1), and these predictions correspond 
well to the actual overlap of ChlP-seq peak regions reported by Chen 



Table 1. Primary and cofactor motifs found by DREME in 13 mES cell 
ChlP-seq datasets 



TF 


Peaks 


m 


r 


Cofactor motifs 


CTCF 


39609 


29 


1 


Myc, STAT3, GAB PA 


cMyc 


3422 


12 


1 


STAT3, Egrl 


E2fl 


20699 


25 


2 a 


STAT3, Myc, Klf4, Fox, 
CREB/ATF 


Esrrb 


21647 


29 


1 


Klf4, Sox2, STAT3, Oct4, Myc, 

Rxra, Zic3, Ewsrl 


Klf4 


10875 


26 


1 


STAT3, Esrrb, Sox2, Oct4, Spl, 
Gata3, Myc, Zfpl61 


Nanog 


10343 


24 


4 a 


Sox2, Oct4, Zic3, Klf4, Elf5, 
Esrrb, Teadl 


nMyc 


7182 


21 


1 


STAT3, Smadl, CREB/ATF, 
Sfpib 


Oct4 


3761 


17 


1 


Sox2, Klf4, CREB/ATF, Esrrb 


STAT3 


2546 


13 


1 


Klf4, Esrrb, Sox2, Oct4, Myc, 
Spl,Irf4 


Smadl 


1126 


10 


No 


Sox2, Oct4, Esrrb, Zic3, Klf4, 
Zfp740 


Sox2 


4526 


19 


1 


Oct4, Klf4, STAT3, Zic3, Esrrb 


Tcfcp211 


26910 


33 


1 


STAT3, Klf4, Esrrb, Egrl, Sox2, 
Oct4, Fox, Myc, Spl, Teadl, 
CREB/ATF 


Zfx 


10338 


20 


1 


STAT3, Myc, Esrrb 



Columns show the name of the ChlP-ed TF; the number of ChlP-seq peaks; the number 
of significant motifs (m) found by DREME (£<0.05); the rank (r) of the ChlP-ed 
TF's motif; and cofactor motifs found. Cofactor TFs are listed in the order of DREME 
significance and in bold font if they are 1 of the 12 pluripotency TFs. Only the cofactor 
TF family name is given when several family members match the DREME motif (e.g. 
'Myc'). 

a See text for discussion of E2fl and Nanog motifs. 

et al. (2008). For example, in addition to finding the motif for the 
ChlP-ed factor in the Tcfcp211 dataset, DREME also discovers the 
motifs for pluripotency factors STAT3, Klf4, Esrrb, Sox2, Oct4 and 
c-Myc/n-Myc (Myc-family motifs are essentially indistinguishable). 
In the CTCF dataset, DREME identifies the CTCF motif as well an 
unknown motif that was recently found to frequently occur upstream 
of the CTCF motif (consensus GCTGCAGT) (Boyle et al, 2010) in 
human cells. 

Many of the other motifs found by DREME can be assigned to TFs 
that are known to be upregulated in mES cells, but whose roles in the 
maintenance of pluripotency are unknown (shown in normal font in 
Table 1, column 4). Several of these cofactor TF motifs are found 
in more than one mESC ChlP-seq dataset. These cofactor motifs 
include Spl, CREB/ATF and Zic3, which was recently suggested 
as being required for the maintenance of pluripotency (Lim et al, 
2010). On the other hand, we are currently unable to assign many 
of the motifs discovered by DREME to any known motif. Although 
some of these unassigned motifs may be artifacts, others may simply 
correspond to TFs whose motifs are not contained in the two motif 
databases we searched. 

Several of the motifs discovered by DREME in the mES cell 
datasets differ substantially from those reported by Chen et al. in 
their original data analysis (Chen et al, 2008). Of particular note, 
DREME discovers the monomer motifs for Oct4 and Sox2 in their 
respective ChlP-seq datasets, rather than the Oct-Sox heterodimer 
motif. The in vivo binding motifs discovered by DREME for Oct4 
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(Sox2) (Oct4) (Nanog) (E2fl) 



Fig. 1. Comparison of DREME mESC TF ChlP-seq motifs with in vitro motifs. Each panel shows the logo of the in vivo binding motif discovered by DREME 
in the designated TF ChlP-seq dataset (lower logo) aligned with the logo of the best available in vitro motif (upper logo). Since no in vitro motifs are available 
for Sox2, Oct4 and E2fl, UniProbe motifs for closely related TF family members Soxll, Pou2f3 and E2f3 are used. The in vitro motif for Nanog is taken 
from Jauch et al. (2008). 



and Sox2 closely match the in vitro motifs derived using protein 
binding microarray (PBM) technology (Berger and Bulyk, 2009), 
suggesting that the PBM motifs accurately capture DNA binding 
of these factors in vivo (Fig. la and b). This illustrates a particular 
advantage of focusing on short, core motifs and we discuss this issue 
in detail below. We also explore the hypothesis (suggested by Chen 
et al.) that Oct4 and Sox2 bind DNA exclusively as a homodimer. 
(The fact that the most significant motif found by DREME in each 
of the two ChlP-seq datasets is the primary factor's suggests this 
hypothesis may be false.) 

DREME finds a motif similar to the known E2f 1 motif in the E2f 1 
ChlP-seq dataset. This is in contrast to Chen et al, who report not 
finding a motif for E2fl using either the WEEDER (Pavesi et al., 
2004) or nestedMICA (Down and Hubbard, 2005) motif discovery 
algorithms. A previous study of E2fl ChlP-chip data also failed 
to find a convincing E2fl motif, concluding that E2fl is mainly 
recruited to promoters via a mechanism distinct from recognition 
and binding to the consensus site (Bieda et al, 2006). Nonetheless, 
in the E2f 1 dataset, the second most significant motif discovered by 
DREME is the word ATGGCG, which occurs in 1274 of the 20699 
ChlP-seq peak regions (E = 10 -98 ) and resembles the in vitro motif 
for E2f-family member E2f3 (Fig. Id). The strong enrichment of 
this word in the E2fl ChlP-seq data and its resemblance to the in 
vitro E2f3 motif suggests that it may represent at least a subset 
of the in vivo binding sites of E2fl in mES cells. The DREME 
motif also is similar to the motif for YY1 (JASPAR MA0095.1, 
consensus ATGG), suggesting the hypothesis that the sites discovered 
by DREME might bind both (but not simultaneously) E2f 1 or YY1 
in mES cells. 

3.1.2 Discovering motifs in mouse erythrocyte ChlP-seq data 
When we apply DREME individually to three ChlP-seq datasets 
from mouse erythroid cells [TFs Gatal (Cheng et al., 2009), 
Klfl (Tallack et al, 2010) and Tall (Kassouf et al, 2010)], the 
top motif is the ChlP-ed motif in the first two cases, but not in 
the last case. With the Gatal dataset, DREME finds the Gatal 
motif (£ = 10~ 892 ) followed by the motif of SPI1 (£ = 10~ 79 , a 
key myeloid developmental regulator (Zakrzewska et al, 2010). It 
also finds the Klfl motif (£=10 -37 ) and and E-box motif (E = 
10 -19 ) that might be due to Tall binding. DREME also discovers 
the motif for Runxl (£=10 -2 ), which may also be involved in 
erythropoeisis (Yokomizo et al, 2008). In all, DREME finds 21 
significant motifs in the Gatal dataset. 



DREME finds a total of six significant motifs in the Klfl 
dataset. It identifies the Klfl motif as the most significant (E = 
10~ 49 ), followed by Gatal (E = 10~ 42 ). Interestingly, the third most 
significant motif is the 'secondary' Klf motif discovered from in vitro 
binding data (Berger and Bulyk, 2009) (E= 10 -11 ). 

In the Tall ChlP-seq dataset, the most significant motif found by 
DREME (among 10 motifs found) is Gatal (E= 10 -262 ), whereas 
the actual Tall binding motif (E-box) is the third most significantly 
found motif (E= 10 -26 ). The lower enrichment of the Tall motif 
compared with Gatal may be due to the fact that Tall often binds 
DNA as a complex with Gata, Lmo2, Ldbl and E47 (Wadman et al, 
1997). In this complex, Tall has reduced contact with the DNA, 
binding to a motif consisting of one-half of an E-box followed 
~10bp later by a Gatal site (Kassouf et al, 2010). This 'semi- 
direct' binding would reduce the prevalence of the full Tall motif 
in the Tall ChlP-seq dataset. DREME also finds motifs for Klfl 
(E= 10 -23 ) and SPI1 (E= 10 -16 ) in this dataset. 



3.1.3 Human lymphoblastoid cell line motifs We apply 
DREME to two vitamin D receptor (VDR) ChlP-seq datasets 
from lymphoblastoid cells before and after stimulation with 
calcitriol (Ramagopalan et al, 2010). DREME finds 19 significant 
motifs in the unstimulated cell data, and only 1 1 in the stimulated 
cell data (E < 0.05). VDR binds DNA as a heterodimer with retinoid 
X receptor (RXR) and the left and right halves of the binding motif 
both have the consensus sequence (A/G) (A/G) G (T/G) TCA. 
DREME finds a single motif matching the consensus in the 
ChlP-seq data from the calcitriol- stimulated cells (fourth most 
significant motif, £=10 -11 ). The top-scoring motif in both 
datasets is an ETS-factor motif, possibly due to binding of Ets-1, 
which has been shown to cause VDR to function as a constitutive 
activator (Tolon et al, 2000). The second most significant motif 
in the stimulated cell data is an E-box, which has been proposed 
as being involved in negative regulation by VDR (Kato et al, 
2007). The third most significant motif found in stimulated cell 
data strongly resembles STAT1, which is known to interact 
with VDR (Vidal et al, 2002). The fifth motif found is another 
Ets motif, and the sixth motif is a Klf motif. The seventh and 
eighth motifs significantly resemble Runx2 and Apl/Fos/Jundm2 
(TOMTOM £<0.05), respectively, both of which are known 
to interact with VDR (Marcellini et al, 2010). In all, six of 
the eight most significant motifs found in the stimulated cell 
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( Nanog vs. Oct4 or Sox2) 

Fig. 2. Discriminative motif discovery in mESC ChlP-seq datasets. Panels 

(a) and (b) show the logo of the binding motif discovered by DREME 
in the two designated TF ChlP-seq datasets (lower logo) aligned with the 
logo of a known motif for the ChlP-ed TF (upper logo), (a) Upper logo is 
known Oct4 motif (Pou-family member Pou3f3, UniProbe Pou3f3_3235.2). 

(b) Upper logo is known Sox2 motif (TRANSFAC M01272). (c) Shows the 
most significant motif found by DREME in the Nanog dataset using (top to 
bottom) the shuffled Nanog dataset, the Oct4 dataset or the Sox2 dataset as 
the negative set. 

data appear to represent the ChlP-ed factor (VDR) or likely 
cof actors. 

3.2 Discriminative motif discovery 

3.2.1 Determining if binding is solely via a heterodimer It has 
been suggested that Sox2 and Oct4 bind their targets exclusively 
as a heterodimer in mES cells (Chen et ah, 2008). If this is so, we 
should not observe enrichment of either motif when we use the ChlP- 
seq datasets for these two TFs as the positive and negative inputs to 
DREME. We performed this experiment (using default values for all 
DREME parameters), and found that the Oct4 dataset is enriched in 
Oct4 binding sites compared with the Sox2 dataset, and vice versa. 

Using Oct4 as the positive and Sox2 as the negative input, 
DREME discovers the Oct4 motif (Fig. 2a, £=10 -29 ), but not 
the Sox2 motif (E > 0.05). DREME reports that this motif occurs in 
18% of the Oct4 ChlP-seq peaks, but in only 9% of the Sox2 peaks. 
The Oct4 motif is thus significantly enriched in Oct4 ChiP-seq peaks 
compared with Sox2 peaks. Reversing the roles of the two ChlP-seq 
datasets, DREME finds the Sox2 motif (Fig. 2a, £=10 -158 ), but 
not the Oct4 motif. The Sox2 motif found occurs in 55% of Sox2 
peaks, but in only 28% of the Oct4 peaks. These two results strongly 
suggest that both Oct4 and Sox2 often bind DNA in mES cells alone 
or with other partners, rather than exclusively with each other as a 
heterodimer. 

3.2.2 Finding context-dependent motifs It was recently reported 
that Oct4 may preferentially bind to a different motif with consensus 
ATGCGCAT when not binding near Sox2 (Mason et al, 2010). They 
used Oct4 and Sox2 ChlP-seq data from Marson et al. (2008) to 
create two Oct4 sequence datasets. The first dataset, Oct4woSox2, 



contains only Oct4 regions where there is no Sox2 peak within 5000 
bp. The second dataset, Oc 1 4 wSox2, contains Oct4 regions that are 
within 50 bp of a Sox2 peak. Using the first and second datasets as 
positive and negative inputs, respectively, to their discriminative 
motif discovery algorithm CMF, they discovered the ATGCGCAT 
'Oct-only ' motif. They also noted that the ChlP-seq peaks containing 
this motif are enriched for a different set of transcription factors 
(nMyc, cMyc, E2fl and Zfxl) than the Oct4 peaks containing a 
nearby Sox2 binding site (Nanog, Sox2 and Smadl), suggesting that 
the preferred in vivo binding motif of Oct4 is context dependent. 

We applied DREME in an analogous way and found that it also 
discovers the 'Oct-only' motif when applied discriminatively to 
the Oct4woSox2 and Oct4wSox2 datasets. On these datasets, 
the 'Oct-only' motif is the most significant one found by DREME 
(E = le-47), and DREME runs about 5.4 times faster than CMF (74 
versus 378 s). Interestingly, the 'Oct-only' motif is also among the 17 
motifs found by DREME in the Chen et al. Oct4 ChlP-seq dataset, 
raising the total number of identifiable motifs in that dataset to five 
(Table 1, column 5). 

3.2.3 Refining the search for the ChlP-ed factor's motif 
Sometimes the most significant motif found by DREME is not that 
of the ChlP-ed TF, and it may not even be clear whether its motif 
has been found at all. For example, the most significant motifs 
found by DREME in the Nanog dataset are Oct4 and Sox2. This 
suggests that Nanog might sometimes be binding indirectly via one 
or both of these TFs. We wondered if a discriminative approach — 
looking for motifs enriched in the Nanog dataset relative to the Oct4 
or Sox2 dataset — might increase our confidence in the motif we 
proposed as the Nanog DNA-binding motif (Fig. lc). We, therefore, 
ran DREME using the Nanog mESC ChlP-seq dataset as the positive 
input, and each of the other two TFs' datasets (individually) as the 
negative input. The most significant motif in each of these two cases 
(Nanog versus Oct4, Nanog versuss Sox2) is indeed highly similar 
to the DREME motif we suggest above as the putative Nanog motif 
(Fig. 2c), and both are more similar to the previously published 
in vitro motif for Nanog [ ( C / A ) ( C / A ) ATT A] than the motif found 
by DREME in non-discriminative mode. When we use TOMTOM to 
compare these two new putative Nanog motifs to our two reference 
motif databases plus the TRANSFAC motif database (Matys et al., 
2006), each is most similar to a motif for Pbxl (data not shown). So, 
as noted above, determining whether this motif is indeed the binding 
motif for Nanog, Pbxl or a combination thereof, may require ChlP- 
seq data for Pbxl in mES cells. However, the fact that this is the 
most highly enriched motif in the Nanog dataset relative to both the 
Oct4 and Sox2 datasets suggests that the motif is not merely due to 
the presence of Pbxl binding sites in the Nanog dataset. 

3.3 Comparison with other motif discovery algorithms 

We compare the speed and accuracy of several popular motif 
discovery algorithms with that of DREME using the 13 mESC ChlP- 
seq datasets. On this particular task, DREME finds substantially 
more cof actor motifs than the other five algorithms (Fig. 3 a, column 
4). Although generally somewhat slower than Amadeus and Trawler, 
DREME discovers more than twice as many identifiable cofactor 
motifs on average compared with Amadeus, and almost 10 times 
as many as Trawler. The later two algorithms only find 8 of the 
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Fig. 3. Comparison of motif discovery algorithms, (a) The table shows the 
average number of motifs discovered (AO, number of datasets in which the 
ChlP-ed motif was found (S), the average number of identifiable co-factor 
motifs found (C), and the average running time in seconds of the algorithm 
on the mESC ChlP-seq datasets. Bold font indicates best performance. 
Note: Times for nestedMICA and MEME are for the reduced size datasets 
(0.5 megabase-pairs). (b) The plot shows the running times for DREME, 
Amadeus, Trawler and WEEDER on the full-size mESC ChlP-seq datasets. 
Inset plot is the same data plotted with log scales on both axes. 



are not yet known. Of course, because it is only designed to find 
short, core motifs, DREME is intended only to complement existing 
motif finders (such as those tested here). A complete analysis of a 
TF ChlP-seq dataset should also include a search for longer, more 
complex motifs. 

Our motif discovery algorithm incorporates ideas from several 
existing algorithms (Barash et al., 2001; Sharov and Ko, 2009; 
Sinha and Tompa, 2003). Barash et al. (2001) uses the Fisher's 
Exact Test to measure the significance of enrichment of motifs in 
two sets of sequences. Their motifs are not regular expressions 
but '<5-balls': the set of words that are within a set Hamming 
distance from a given word. This motif definition treats variations 
from the consensus word the same, regardless of position within 
the motif. Real TF motifs are less tolerant of variation in certain 
positions (Matys et al, 2006), and this is better captured by regular 
expressions, which explicitly list all the allowed variations (if any) 
at each motif position. Several algorithms including YMF (Sinha 
and Tompa, 2003) use regular expressions at some stage in the 
search for motifs, but differ from our algorithm in other respects. 
For example, YMF only allows a subset of the IUPAC wildcards and 
scores motifs using a different statistical test (Z- score). Counting the 
number of sequences (not occurrences) is equivalent to the 'Zero or 
One Occurrence Per Sequence' (ZOOPS) model used by numerous 
motif discovery algorithms, and finding multiple motifs by erasing is 
reminiscent of MEME (Bailey and Elkan, 1995). CisFinder (Sharov 
and Ko, 2009) also uses the idea of counting words and computing 
relative enrichment in two sets of sequences, is extremely fast, and 
can find many cof actor motifs in ChlP-seq datasets. Unlike DREME, 
however, it requires a motif clustering step to (partially) remove 
redundant motifs, and cannot be restricted to finding short, core 
motifs. Also, in contrast with DREME, CisFinder reports a /?-value 
for each motif that is not a measure of the significance of the motif, 
but only of a single word (without wildcards) matching the motif. 



13 ChlP-ed motifs, whereas DREME, WEEDER, nestedMICA and 
MEME find 10 (Fig. 3a, column 3). 

WEEDER is much slower than DREME, and its running time 
scales non-linearly with dataset size (Fig. 3b), as do MEME and 
nestedMICA (data not shown). The full-size datasets contain on 
average about 10000 sequences of length 100 bp, and DREME 
completes its analysis in 19min on average, whereas WEEDER 
requires almost 10 times as long. Both MEME and nestedMICA are 
really too slow to handle ChlP-seq datasets containing thousands 
of sequences, failing to finish running after many days. For this 
reason, results in Figure 3a for nestedMICA and MEME are for the 
reduced-size datasets containing 500 000 bp. 

4 DISCUSSION 

Focusing on short, core motifs in TF ChlP-seq datasets appears to be 
a very profitable approach to understanding patterns of DNA binding 
by TFs. Additional insight can also be gained by searching for motifs 
that are relatively enriched in one ChlP-seq dataset compared with 
another. The ab initio discovered motifs can often be associated 
with the probable TF that binds them by comparison to existing 
compendia of TF motifs, using expression in the ChlP-ed tissue and 
prior knowledge as a filter. The large numbers of motifs discovered 
by DREME that can be reliably assigned to TFs suggests that 
unassigned motifs may represent binding by other TFs whose motifs 
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